Shotgun Metagenomic Data Analysis    ◾    305

probability that a scaffold belongs to a bin using an expectation-maximization (EM) algo-

rithm. The program also provides statistics, including genome completeness, GC-content,

and genome size. Figure 8.1 shows the steps from sequencing to binning.

8.2  SHOTGUN METAGENOMIC ANALYSIS WORKFLOW

The first two steps of the metagenomic data analysis workflow are raw data acquisition and

quality control. After the quality control, the raw data can pass through two different steps:

(i) de novo assembly and subsequent analysis and (ii) assembly-free analysis. In the follow-

ing, we discuss these steps with a worked example.

8.2.1  Data Acquisition

The raw shotgun metagenomic data is sequences of the metagenomic DNA extracted from

either environmental or clinical samples which usually contain several species of microbes.

Depending on the sequencing technology, data can be short reads produced by Illumina

and other short-read sequencing technologies or long reads produced either by Pacific

Bioscience (PacBio) or by Oxford Nanopore Technology (ONT). The read layout can also be

single end or paired end. The raw data is usually provided in FASTQ files. Many research-

ers uploaded their raw data to a database like NCBI SRA and make it available for public.

We will download FASTQ files from the NCBI SRA data for the purpose of demonstrat-

ing how analysis is conducted. The run numbers are “ERR1823587”, “ERR1823601”, and

“ERR1823608” which contain shotgun metagenomic data of human stool samples from a

healthy, a moderate, and a severe sickle cell disease patient, respectively. We will create the

directory “shotgun” as the project working directory; then, we will use the SRA-toolkits

“fasterq-dump” utility to download the paired-end files in a directory called “fastqdir”.

mkdir shotgun; cd shotgun

fasterq-dump --threads 4 --verbose --outdir fastqdir ERR1823587

fasterq-dump --threads 4 --verbose --outdir fastqdir ERR1823601

fasterq-dump --threads 4 --verbose --outdir fastqdir ERR1823608

Six FASTQ files will be saved in the “fastqdir”. Use “ls fastqdir/” to display the content of

that directory to make sure the files are there.

8.2.2  Quality Assessment and Processing

If you obtained these files directly from the sequencer, it is likely that they may need quality

control, which includes both quality assessments using one of the quality assessment pro-

grams like FastQC and processing to filter out the low-quality reads, to trim low-quality

Sequencing

Reads

Assembly

Binning

Bin 3

Bin 1

Bin 2

Bin 4

Contigs

FIGURE 8.1  Reads processing from sequencing to bin formation.